Datavisualisatie Draft
Part A
# Installing necessary packages
!pip install seaborn kaleido plotly matplotlib scikit-learn
# Importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import zipfile
import matplotlib.pyplot as plt
import kaleido
import plotly
import plotly.express as px
import json
from IPython.display import Image
from urllib.request import urlopen
from sklearn import datasets, linear_model
# Displaying versions to ensure correct installation
print("Kaleido version:", kaleido.__version__)
print("Plotly version:", plotly.__version__)
Requirement already satisfied: seaborn in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (0.13.2)
Requirement already satisfied: kaleido in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (0.2.1)
Requirement already satisfied: plotly in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (5.22.0)
Requirement already satisfied: matplotlib in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (3.9.0)
Collecting scikit-learn
Downloading scikit_learn-1.5.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (11 kB)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from seaborn) (1.26.4)
Requirement already satisfied: pandas>=1.2 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from seaborn) (2.2.2)
Requirement already satisfied: tenacity>=6.2.0 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from plotly) (8.3.0)
Requirement already satisfied: packaging in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from plotly) (24.1)
Requirement already satisfied: contourpy>=1.0.1 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from matplotlib) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from matplotlib) (4.53.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from matplotlib) (1.4.5)
Requirement already satisfied: pillow>=8 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from matplotlib) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from matplotlib) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from matplotlib) (2.9.0.post0)
Collecting scipy>=1.6.0 (from scikit-learn)
Downloading scipy-1.13.1-cp312-cp312-macosx_12_0_arm64.whl.metadata (60 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.6/60.6 kB 1.5 MB/s eta 0:00:00
Collecting joblib>=1.2.0 (from scikit-learn)
Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Requirement already satisfied: pytz>=2020.1 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: six>=1.5 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Downloading scikit_learn-1.5.0-cp312-cp312-macosx_12_0_arm64.whl (11.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.0/11.0 MB 1.8 MB/s eta 0:00:0000:0100:01
Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 301.8/301.8 kB 1.7 MB/s eta 0:00:00a 0:00:01
Downloading scipy-1.13.1-cp312-cp312-macosx_12_0_arm64.whl (30.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30.4/30.4 MB 1.7 MB/s eta 0:00:0000:0100:01
Downloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn
Successfully installed joblib-1.4.2 scikit-learn-1.5.0 scipy-1.13.1 threadpoolctl-3.5.0
Kaleido version: 0.2.1
Plotly version: 5.22.0
Mijn twee gekozen datasets:
https://www.kaggle.com/datasets/mikejohnsonjr/united-states-crime-rates-by-county/data
Deze dataset bevat misdaadcijfers per county in de Verenigde Staten. Het heeft kolommen zoals, diefstal, verkrachting, moord, inwoners en county-namen. De dataset laat gedetailleerde informatie over verschillende misdaadtypes en de bevolkingsomvang zien, wat nuttig is voor criminologisch onderzoek en beleidsvorming.
number of instances: 3136
number of attributes: 24
variables:
- county_name (nominaal, object, discreet, 0 missing values)
- crime_rate_per_100000 (ratio, float64, continuous, 0 missing values)
- ROBBERY (ratio, int64, discreet, 0 missing values)
- MURDER (ratio, int64, discreet, 0 missing values)
- population (ratio, int64, discreet, 0 missing values)
Question to explore:
Is er een verband tussen de populatie van een county en de frequentie van bepaalde misdaden zoals verkrachtig en diefstal?
https://www.kaggle.com/datasets/muonneutrino/us-census-demographic-data
Deze dataset bevat gegevens per county in de Verenigde Staten. Het heeft kolommen zoals, waar mensen vandaan komen (asian, hispanic), werkeloos, thuiswerkenden, totale populatie en inkomen per persoon. Het laat vooral sociaaleconomische en demografische cijfers zien.
number of instances: 3142
number of attributes: 37
variables:
- County (nominaal, object, discreet, 0 missing values)
- Employed (ratio, int64, discreet, 0 missing values)
- Men (ratio, int64, discreet, 0 missing values)
- Hispanic (ratio, float64, continuous, 0 missing values)
- TotalPop (ratio, int64, discreet, 0 missing values)
Question to explore:
Hoe verschilt de werkgelegenheidssituatie tussen verschillende etnische groepen, zoals hispanic, asian, black, etc. in verschillende counties in de Verenigde Staten.
https://www.openintro.org/data/?data=county_complete
https://www.kaggle.com/code/stefancomanita/american-statistics-visualized-on-maps-w-plotly
● Relevant (based on what you were taught in class) descriptive statistics for the
above chosen 5 variables. Exclude missing values when calculating descriptive
statistics. You do not have to report Kurtosis and Skewness.
zip_file_path = r'/Users/jetzeeveleens/Downloads/crime.zip'
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
file_list = zip_ref.namelist()
print(file_list)
csv_file_name = 'crime_data_w_population_and_crime_rate.csv'
zip_ref.extract(csv_file_name, '/tmp')
csv_file_path = '/tmp/' + csv_file_name
crime = pd.read_csv(csv_file_path)
display(crime)
['crime_data_w_population_and_crime_rate.csv']
| county_name | crime_rate_per_100000 | index | EDITION | PART | IDNO | CPOPARST | CPOPCRIM | AG_ARRST | AG_OFF | ... | RAPE | ROBBERY | AGASSLT | BURGLRY | LARCENY | MVTHEFT | ARSON | population | FIPS_ST | FIPS_CTY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | St. Louis city, MO | 1791.995377 | 1 | 1 | 4 | 1612 | 318667 | 318667 | 15 | 15 | ... | 200 | 1778 | 3609 | 4995 | 13791 | 3543 | 464 | 318416 | 29 | 510 |
| 1 | Crittenden County, AR | 1754.914968 | 2 | 1 | 4 | 130 | 50717 | 50717 | 4 | 4 | ... | 38 | 165 | 662 | 1482 | 1753 | 189 | 28 | 49746 | 5 | 35 |
| 2 | Alexander County, IL | 1664.700485 | 3 | 1 | 4 | 604 | 8040 | 8040 | 2 | 2 | ... | 2 | 5 | 119 | 82 | 184 | 12 | 2 | 7629 | 17 | 3 |
| 3 | Kenedy County, TX | 1456.310680 | 4 | 1 | 4 | 2681 | 444 | 444 | 1 | 1 | ... | 3 | 1 | 2 | 5 | 4 | 4 | 0 | 412 | 48 | 261 |
| 4 | De Soto Parish, LA | 1447.402430 | 5 | 1 | 4 | 1137 | 26971 | 26971 | 3 | 3 | ... | 4 | 17 | 368 | 149 | 494 | 60 | 0 | 27083 | 22 | 31 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3131 | Ohio County, IN | 0.000000 | 3132 | 1 | 4 | 762 | 6084 | 0 | 2 | 1 | ... | 0 | 0 | 0 | 2 | 2 | 0 | 0 | 5994 | 18 | 115 |
| 3132 | Newton County, MS | 0.000000 | 3133 | 1 | 4 | 1465 | 21545 | 3346 | 3 | 1 | ... | 0 | 0 | 0 | 4 | 0 | 1 | 0 | 21689 | 28 | 101 |
| 3133 | Jerauld County, SD | 0.000000 | 3134 | 1 | 4 | 2424 | 2108 | 2108 | 1 | 1 | ... | 0 | 0 | 0 | 1 | 3 | 1 | 0 | 2066 | 46 | 73 |
| 3134 | Cimarron County, OK | 0.000000 | 3135 | 1 | 4 | 2167 | 2502 | 2502 | 2 | 2 | ... | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 2335 | 40 | 25 |
| 3135 | Lawrence County, MS | 0.000000 | 3136 | 1 | 4 | 1453 | 12714 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12514 | 28 | 77 |
3136 rows × 24 columns
2e dataset
zip_file_path = r'/Users/jetzeeveleens/Downloads/census.zip'
csv_file_name = 'acs2017_county_data.csv'
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
zip_ref.extract(csv_file_name, '/tmp')
csv_file_path = '/tmp/' + csv_file_name
census = pd.read_csv(csv_file_path)
census = census.drop(census[census["State"] == "Puerto Rico"].index)
display(census)
| CountyId | State | County | TotalPop | Men | Women | Hispanic | White | Black | Native | ... | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1001 | Alabama | Autauga County | 55036 | 26899 | 28137 | 2.7 | 75.4 | 18.9 | 0.3 | ... | 0.6 | 1.3 | 2.5 | 25.8 | 24112 | 74.1 | 20.2 | 5.6 | 0.1 | 5.2 |
| 1 | 1003 | Alabama | Baldwin County | 203360 | 99527 | 103833 | 4.4 | 83.1 | 9.5 | 0.8 | ... | 0.8 | 1.1 | 5.6 | 27.0 | 89527 | 80.7 | 12.9 | 6.3 | 0.1 | 5.5 |
| 2 | 1005 | Alabama | Barbour County | 26201 | 13976 | 12225 | 4.2 | 45.7 | 47.8 | 0.2 | ... | 2.2 | 1.7 | 1.3 | 23.4 | 8878 | 74.1 | 19.1 | 6.5 | 0.3 | 12.4 |
| 3 | 1007 | Alabama | Bibb County | 22580 | 12251 | 10329 | 2.4 | 74.6 | 22.0 | 0.4 | ... | 0.3 | 1.7 | 1.5 | 30.0 | 8171 | 76.0 | 17.4 | 6.3 | 0.3 | 8.2 |
| 4 | 1009 | Alabama | Blount County | 57667 | 28490 | 29177 | 9.0 | 87.4 | 1.5 | 0.3 | ... | 0.4 | 0.4 | 2.1 | 35.0 | 21380 | 83.9 | 11.9 | 4.0 | 0.1 | 4.9 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3137 | 56037 | Wyoming | Sweetwater County | 44527 | 22981 | 21546 | 16.0 | 79.6 | 0.8 | 0.6 | ... | 2.8 | 1.3 | 1.5 | 20.5 | 22739 | 78.4 | 17.8 | 3.8 | 0.0 | 5.2 |
| 3138 | 56039 | Wyoming | Teton County | 22923 | 12169 | 10754 | 15.0 | 81.5 | 0.5 | 0.3 | ... | 11.7 | 3.8 | 5.7 | 14.3 | 14492 | 82.1 | 11.4 | 6.5 | 0.0 | 1.3 |
| 3139 | 56041 | Wyoming | Uinta County | 20758 | 10593 | 10165 | 9.1 | 87.7 | 0.1 | 0.9 | ... | 1.1 | 1.3 | 2.0 | 19.9 | 9528 | 71.5 | 21.5 | 6.6 | 0.4 | 6.4 |
| 3140 | 56043 | Wyoming | Washakie County | 8253 | 4118 | 4135 | 14.2 | 82.2 | 0.3 | 0.4 | ... | 6.9 | 1.3 | 4.4 | 14.3 | 3833 | 69.8 | 22.0 | 8.1 | 0.2 | 6.1 |
| 3141 | 56045 | Wyoming | Weston County | 7117 | 3756 | 3361 | 1.4 | 91.6 | 0.5 | 0.1 | ... | 3.0 | 1.6 | 6.9 | 25.7 | 3407 | 68.2 | 21.9 | 8.8 | 1.1 | 2.2 |
3142 rows × 37 columns
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)
nee = crime['county_name'].unique()
print(len(nee))
ja = census['County'].unique()
print(len(ja))
#Er zijn 3.142 counties in de Verenigde Staten
3136 1877
def convertToFipsForCensus(row):
countyId = row["CountyId"]
if countyId >= 10000:
return str(countyId)
return "0" + str(countyId)
census["fips"] = census.apply(lambda row: convertToFipsForCensus(row), axis = 1)
census.head()
| CountyId | State | County | TotalPop | Men | Women | Hispanic | White | Black | Native | ... | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment | fips | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1001 | Alabama | Autauga County | 55036 | 26899 | 28137 | 2.7 | 75.4 | 18.9 | 0.3 | ... | 1.3 | 2.5 | 25.8 | 24112 | 74.1 | 20.2 | 5.6 | 0.1 | 5.2 | 01001 |
| 1 | 1003 | Alabama | Baldwin County | 203360 | 99527 | 103833 | 4.4 | 83.1 | 9.5 | 0.8 | ... | 1.1 | 5.6 | 27.0 | 89527 | 80.7 | 12.9 | 6.3 | 0.1 | 5.5 | 01003 |
| 2 | 1005 | Alabama | Barbour County | 26201 | 13976 | 12225 | 4.2 | 45.7 | 47.8 | 0.2 | ... | 1.7 | 1.3 | 23.4 | 8878 | 74.1 | 19.1 | 6.5 | 0.3 | 12.4 | 01005 |
| 3 | 1007 | Alabama | Bibb County | 22580 | 12251 | 10329 | 2.4 | 74.6 | 22.0 | 0.4 | ... | 1.7 | 1.5 | 30.0 | 8171 | 76.0 | 17.4 | 6.3 | 0.3 | 8.2 | 01007 |
| 4 | 1009 | Alabama | Blount County | 57667 | 28490 | 29177 | 9.0 | 87.4 | 1.5 | 0.3 | ... | 0.4 | 2.1 | 35.0 | 21380 | 83.9 | 11.9 | 4.0 | 0.1 | 4.9 | 01009 |
5 rows × 38 columns
def createFipsForCrime(row):
cityFips = str(row["FIPS_CTY"])
stateFips = str(row["FIPS_ST"])
if len(cityFips) == 1:
cityFips = "00" + cityFips
if len(cityFips) == 2:
cityFips = "0" + cityFips
if len(stateFips) == 1:
stateFips = "0" + stateFips
return stateFips + cityFips
crime["fips"] = crime.apply(lambda row: createFipsForCrime(row), axis = 1)
crime.head()
| county_name | crime_rate_per_100000 | index | EDITION | PART | IDNO | CPOPARST | CPOPCRIM | AG_ARRST | AG_OFF | ... | ROBBERY | AGASSLT | BURGLRY | LARCENY | MVTHEFT | ARSON | population | FIPS_ST | FIPS_CTY | fips | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | St. Louis city, MO | 1791.995377 | 1 | 1 | 4 | 1612 | 318667 | 318667 | 15 | 15 | ... | 1778 | 3609 | 4995 | 13791 | 3543 | 464 | 318416 | 29 | 510 | 29510 |
| 1 | Crittenden County, AR | 1754.914968 | 2 | 1 | 4 | 130 | 50717 | 50717 | 4 | 4 | ... | 165 | 662 | 1482 | 1753 | 189 | 28 | 49746 | 5 | 35 | 05035 |
| 2 | Alexander County, IL | 1664.700485 | 3 | 1 | 4 | 604 | 8040 | 8040 | 2 | 2 | ... | 5 | 119 | 82 | 184 | 12 | 2 | 7629 | 17 | 3 | 17003 |
| 3 | Kenedy County, TX | 1456.310680 | 4 | 1 | 4 | 2681 | 444 | 444 | 1 | 1 | ... | 1 | 2 | 5 | 4 | 4 | 0 | 412 | 48 | 261 | 48261 |
| 4 | De Soto Parish, LA | 1447.402430 | 5 | 1 | 4 | 1137 | 26971 | 26971 | 3 | 3 | ... | 17 | 368 | 149 | 494 | 60 | 0 | 27083 | 22 | 31 | 22031 |
5 rows × 25 columns
crime_census = census.merge(crime, how="left", on="fips")
display(crime_census.head())
display(crime_census.shape)
print('----------------------------------------')
display(crime_census.isnull().sum())
print('--------------------------------------------------------------------------------')
print(crime_census.columns)
| CountyId | State | County | TotalPop | Men | Women | Hispanic | White | Black | Native | ... | RAPE | ROBBERY | AGASSLT | BURGLRY | LARCENY | MVTHEFT | ARSON | population | FIPS_ST | FIPS_CTY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1001 | Alabama | Autauga County | 55036 | 26899 | 28137 | 2.7 | 75.4 | 18.9 | 0.3 | ... | 15.0 | 34.0 | 87.0 | 447.0 | 1233.0 | 85.0 | 108.0 | 55246.0 | 1.0 | 1.0 |
| 1 | 1003 | Alabama | Baldwin County | 203360 | 99527 | 103833 | 4.4 | 83.1 | 9.5 | 0.8 | ... | 30.0 | 76.0 | 332.0 | 967.0 | 3829.0 | 192.0 | 31.0 | 195540.0 | 1.0 | 3.0 |
| 2 | 1005 | Alabama | Barbour County | 26201 | 13976 | 12225 | 4.2 | 45.7 | 47.8 | 0.2 | ... | 4.0 | 8.0 | 36.0 | 90.0 | 362.0 | 21.0 | 0.0 | 27076.0 | 1.0 | 5.0 |
| 3 | 1007 | Alabama | Bibb County | 22580 | 12251 | 10329 | 2.4 | 74.6 | 22.0 | 0.4 | ... | 4.0 | 8.0 | 36.0 | 122.0 | 251.0 | 27.0 | 0.0 | 22512.0 | 1.0 | 7.0 |
| 4 | 1009 | Alabama | Blount County | 57667 | 28490 | 29177 | 9.0 | 87.4 | 1.5 | 0.3 | ... | 11.0 | 9.0 | 101.0 | 397.0 | 865.0 | 86.0 | 9.0 | 57872.0 | 1.0 | 9.0 |
5 rows × 62 columns
(3142, 62)
----------------------------------------
CountyId 0
State 0
County 0
TotalPop 0
Men 0
..
MVTHEFT 9
ARSON 9
population 9
FIPS_ST 9
FIPS_CTY 9
Length: 62, dtype: int64
--------------------------------------------------------------------------------
Index(['CountyId', 'State', 'County', 'TotalPop', 'Men', 'Women', 'Hispanic',
'White', 'Black', 'Native', 'Asian', 'Pacific', 'VotingAgeCitizen',
'Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr', 'Poverty',
'ChildPoverty', 'Professional', 'Service', 'Office', 'Construction',
'Production', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp',
'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork',
'SelfEmployed', 'FamilyWork', 'Unemployment', 'fips', 'county_name',
'crime_rate_per_100000', 'index', 'EDITION', 'PART', 'IDNO', 'CPOPARST',
'CPOPCRIM', 'AG_ARRST', 'AG_OFF', 'COVIND', 'INDEX', 'MODINDX',
'MURDER', 'RAPE', 'ROBBERY', 'AGASSLT', 'BURGLRY', 'LARCENY', 'MVTHEFT',
'ARSON', 'population', 'FIPS_ST', 'FIPS_CTY'],
dtype='object')
def showMap(df: pd.DataFrame, counties, target: str, colorscheme=px.colors.diverging.PRGn, min=5, max=50):
fig = px.choropleth(df, geojson=counties, locations='fips', color=target,
range_color=(min, max),
scope="usa",
labels={target: f'{target}'},
color_continuous_midpoint=((max - min) / 3) + min,
color_continuous_scale=colorscheme
)
fig.update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0})
return fig
minUnemployment = crime_census["crime_rate_per_100000"].min()
maxUnemployment = crime_census["crime_rate_per_100000"].max()
print(minUnemployment, maxUnemployment)
0.0 1791.995377
crime_census['black_hispanic'] = crime_census['Black'] + crime_census['Hispanic']
crimes_per = showMap(crime_census, counties, "crime_rate_per_100000", min=0, max=400, colorscheme='Reds')
crimes_per.show()
hispan = showMap(crime_census, counties, "VotingAgeCitizen", min=0, max=70000, colorscheme='emrld')
hispan.show()
print(crime_census['VotingAgeCitizen'].max(), crime_census['VotingAgeCitizen'].min())
6218279 59
total_pop = showMap(crime_census, counties, "TotalPop", min=0, max=75000, colorscheme='Greens')
total_pop.show()
print(crime_census['TotalPop'].max(), crime_census['TotalPop'].min())
10105722 74
income_cap = showMap(crime_census, counties, "Unemployment", min=0, max=10, colorscheme='greys')
income_cap.show()
print(crime_census['Unemployment'].max(), crime_census['Unemployment'].min())
28.8 0.0
child_pov = showMap(crime_census, counties, "Poverty", min=0, max=25, colorscheme='sunset')
child_pov.show()
print(crime_census['Poverty'].max(), crime_census['Poverty'].min())
52.0 2.4
income_cap = showMap(crime_census, counties, "IncomePerCap", min=9000, max=25000, colorscheme='mint')
income_cap.show()
print(crime_census['IncomePerCap'].max(), crime_census['IncomePerCap'].min())
69529 9334
# Verwijder rijen met ontbrekende waarden in de kolom 'crime_rate_per_100000'
crime_census_cleaned = crime_census.dropna(subset=['crime_rate_per_100000', 'ChildPoverty'])
from sklearn.linear_model import LinearRegression
plt.figure(figsize=(10, 6))
sns.scatterplot(data=crime_census_cleaned, x='ChildPoverty', y='crime_rate_per_100000', marker='o', color='b')
plt.title('Correlatie tussen Crime Rate per 100,000 en Unemployment')
plt.xlabel('Kinderarmoede')
plt.ylabel('Criminaliteitscijfer per 100.000')
x = crime_census_cleaned[['ChildPoverty']]
y = crime_census_cleaned['crime_rate_per_100000']
reg = LinearRegression()
reg.fit(x, y)
predictions = reg.predict(x)
plt.plot(x, predictions, color='r')
plt.show()
['CountyId', 'State', 'County', 'TotalPop', 'Men', 'Women', 'Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific', 'VotingAgeCitizen', 'Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr', 'Poverty', 'ChildPoverty', 'Professional', 'Service', 'Office', 'Construction', 'Production', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp', 'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork', 'SelfEmployed', 'FamilyWork', 'Unemployment', 'fips', 'county_name', 'crime_rate_per_100000', 'index', 'EDITION', 'PART', 'IDNO', 'CPOPARST', 'CPOPCRIM', 'AG_ARRST', 'AG_OFF', 'COVIND', 'INDEX', 'MODINDX', 'MURDER', 'RAPE', 'ROBBERY', 'AGASSLT', 'BURGLRY', 'LARCENY', 'MVTHEFT', 'ARSON', 'population', 'FIPS_ST', 'FIPS_CTY']
'aggrnyl', 'agsunset', 'algae', 'amp', 'armyrose', 'balance', 'blackbody', 'bluered', 'blues', 'blugrn', 'bluyl', 'brbg', 'brwnyl', 'bugn', 'bupu', 'burg', 'burgyl', 'cividis', 'curl', 'darkmint', 'deep', 'delta', 'dense', 'earth', 'edge', 'electric', 'emrld', 'fall', 'geyser', 'gnbu', 'gray', 'greens', 'greys', 'haline', 'hot', 'hsv', 'ice', 'icefire', 'inferno', 'jet', 'magenta', 'magma', 'matter', 'mint', 'mrybm', 'mygbm', 'oranges', 'orrd', 'oryel', 'oxy', 'peach', 'phase', 'picnic', 'pinkyl', 'piyg', 'plasma', 'plotly3', 'portland', 'prgn', 'pubu', 'pubugn', 'puor', 'purd', 'purp', 'purples', 'purpor', 'rainbow', 'rdbu', 'rdgy', 'rdpu', 'rdylbu', 'rdylgn', 'redor', 'reds', 'solar', 'spectral', 'speed', 'sunset', 'sunsetdark', 'teal', 'tealgrn', 'tealrose', 'tempo', 'temps', 'thermal', 'tropic', 'turbid', 'turbo', 'twilight', 'viridis', 'ylgn', 'ylgnbu', 'ylorbr', 'ylorrd'
correlation = crime_census['Income'].corr(crime_census['crime_rate_per_100000'], method='pearson')
print("Correlatie tussen het inkomen en de criminaliteits tarief:", round(correlation,2))
Correlatie tussen het inkomen en de criminaliteits tarief: -0.14
Part B
# Download the COVID-19 dataset
covid_url = "https://covid.ourworldindata.org/data/owid-covid-data.csv"
covid_df = pd.read_csv(covid_url)
# Convert the date column to datetime format
covid_df['date'] = pd.to_datetime(covid_df['date'])
# Filter data for the year 2023
covid_df_2023 = covid_df[covid_df['date'].dt.year == 2023]
# Sort by date to ensure the latest data is selected for each country
covid_df_sorted = covid_df_2023.sort_values('date')
# Group by country and get the last entry for each country
covid_df_aggregated = covid_df_sorted.groupby('location').last().reset_index()
# Select relevant columns
covid_selected_columns = ['location', 'total_cases', 'total_deaths', 'total_vaccinations', 'population']
covid_df_aggregated = covid_df_aggregated[covid_selected_columns]
# Rename columns for clarity
covid_df_aggregated.rename(columns={'location': 'country'}, inplace=True)
# Display the aggregated DataFrame
display(covid_df_aggregated)
| country | total_cases | total_deaths | total_vaccinations | population | |
|---|---|---|---|---|---|
| 0 | Afghanistan | 230375.0 | 7973.0 | 2.296475e+07 | 4.112877e+07 |
| 1 | Africa | 13133432.0 | 259066.0 | 8.632379e+08 | 1.426737e+09 |
| 2 | Albania | 334596.0 | 3604.0 | 3.088966e+06 | 2.842318e+06 |
| 3 | Algeria | 272010.0 | 6881.0 | NaN | 4.490323e+07 |
| 4 | American Samoa | 8359.0 | 34.0 | NaN | 4.429500e+04 |
| ... | ... | ... | ... | ... | ... |
| 248 | Wallis and Futuna | 3550.0 | 8.0 | 1.805800e+04 | 1.159600e+04 |
| 249 | World | 773948532.0 | 7015947.0 | 1.357576e+10 | 7.975105e+09 |
| 250 | Yemen | 11945.0 | 2159.0 | 1.298654e+06 | 3.369661e+07 |
| 251 | Zambia | 349304.0 | 4069.0 | 1.345421e+07 | 2.001767e+07 |
| 252 | Zimbabwe | 266071.0 | 5731.0 | NaN | 1.632054e+07 |
253 rows × 5 columns
df2 = pd.read_csv('world_population.csv')
display(df2.head(), df2.isnull().sum())
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) Cell In[29], line 1 ----> 1 df2 = pd.read_csv('world_population.csv') 2 display(df2.head(), df2.isnull().sum()) File /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend) 1013 kwds_defaults = _refine_defaults_read( 1014 dialect, 1015 delimiter, (...) 1022 dtype_backend=dtype_backend, 1023 ) 1024 kwds.update(kwds_defaults) -> 1026 return _read(filepath_or_buffer, kwds) File /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds) 617 _validate_names(kwds.get("names", None)) 619 # Create the parser. --> 620 parser = TextFileReader(filepath_or_buffer, **kwds) 622 if chunksize or iterator: 623 return parser File /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds) 1617 self.options["has_index_names"] = kwds["has_index_names"] 1619 self.handles: IOHandles | None = None -> 1620 self._engine = self._make_engine(f, self.engine) File /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine) 1878 if "b" not in mode: 1879 mode += "b" -> 1880 self.handles = get_handle( 1881 f, 1882 mode, 1883 encoding=self.options.get("encoding", None), 1884 compression=self.options.get("compression", None), 1885 memory_map=self.options.get("memory_map", False), 1886 is_text=is_text, 1887 errors=self.options.get("encoding_errors", "strict"), 1888 storage_options=self.options.get("storage_options", None), 1889 ) 1890 assert self.handles is not None 1891 f = self.handles.handle File /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 868 elif isinstance(handle, str): 869 # Check whether the filename is to be opened in binary mode. 870 # Binary mode does not support 'encoding' and 'newline'. 871 if ioargs.encoding and "b" not in ioargs.mode: 872 # Encoding --> 873 handle = open( 874 handle, 875 ioargs.mode, 876 encoding=ioargs.encoding, 877 errors=errors, 878 newline="", 879 ) 880 else: 881 # Binary mode 882 handle = open(handle, ioargs.mode) FileNotFoundError: [Errno 2] No such file or directory: 'world_population.csv'
df3 = covid_df_aggregated.merge(df2, left_on='country', right_on='Country/Territory', how='inner')
display(df3.head())
correlation = df3['total_cases'].corr(df3['Density (per km²)'], method='pearson')
print("Pearson's correlation coefficient:", round(correlation,2))
correlation = df3['total_cases'].corr(df3['Area (km²)'], method='pearson')
print("Pearson's correlation coefficient:", round(correlation,2))
correlation = df3['total_cases'].corr(df3['population'], method='pearson')
print("Pearson's correlation coefficient:", round(correlation,2))